5 summarise clusters #6

RFOxbury · 2024-10-08T15:56:42Z

Fixes issues #5 (topic summarisation) and #7 (semantic chunking)

To review:

Run python dsp_interview_transcripts/pipeline/run_pipeline.py / follow the instructions in this readme. Check that everything runs and the outputs are as expected.

There is also a notebook on semantic chunking but only look at that if you are interested - we're not taking that route for now.

* Add cluster naming/summarisation step * Add basic semantic chunking script for comparison with other datasets

* There is now one set of scripts for a bottom-up analysis * A streamlit app to visualise the output of the bottom-up analysis * One script to execute a top-down analysis

…_interview_transcripts into 5-summarise-clusters

beingkk

Thank you Rosie, I've looked through most of the code and test ran it - it's all good!

I've made a few small comments (in addition to the ones I sent over Slack) - take a look and see if you'd like to address them now.

Otherwise, happy for you to merge and continue with the evals and dashboard.

I also added a few tests that helped me better understand the data processing functions. Didn't have time to do "test-driven reviewing" for the rest of the code, however.

beingkk · 2024-11-08T17:50:05Z

dsp_interview_transcripts/utils/data_cleaning.py

    # Fix improperly coded characters
-    data_df['text_clean'] = data_df['text'].apply(lambda x: ftfy.fix_text(x))
-
+    data_df["text_clean"] = data_df["text"].apply(lambda x: ftfy.fix_text(x))


Nice, I didn't now about ftfy.fix_text

beingkk · 2024-11-08T17:59:28Z

dsp_interview_transcripts/pipeline/name_clusters.py

+def name_topics(
+    topic_info: pd.DataFrame,
+    llm_chain,
+    topics: List[str],


I guess this topics variable could be removed and you could move the line

topics = topic_info["Cluster"].unique().tolist()

inside this function, to simplify the input variables.

beingkk · 2024-11-08T18:01:37Z

dsp_interview_transcripts/pipeline/name_clusters.py

+
+if __name__ == "__main__":
+
+    errors = []


Are you doing anything with this list in this script?

beingkk · 2024-11-12T10:25:46Z

dsp_interview_transcripts/pipeline/process_data.py

+
+def match_questions(bot_qs_list: List[str], questions: List[str], threshold: float = 0.85) -> pd.DataFrame:
+    """Match the input questions from our interview template,
+    and the actual questions produced by the bot, using cosine similarity.


Suggested change

and the actual questions produced by the bot, using cosine similarity.

and the actual sentences produced by the bot, using cosine similarity.

I guess you're not just matching questions but in fact all sentences?

Hi, yes this is correct! Good spot

beingkk · 2024-11-12T10:49:23Z

dsp_interview_transcripts/pipeline/process_data.py

+    interviews_cleaned_df = interviews_df.groupby("conversation").apply(remove_preamble).reset_index(drop=True)
+


It doesn't look like you're using or saving interviews_cleaned_df after this...

Also a good spot - thank you!!

dsp_interview_transcripts/pipeline/README.md

dsp_interview_transcripts/pipeline/name_clusters.py

beingkk · 2024-11-12T11:11:24Z

dsp_interview_transcripts/pipeline/prep_output_tables.py

+    topic_counts = pd.DataFrame(data["Cluster"].value_counts()).reset_index()
+    topic_counts = topic_counts.rename(columns={"count": "N responses in topic"})
+


Minor comment, but I would like us to use pandas chaining as much as possible as it makes it easier to read the code.

topic_counts = ( pd.DataFrame(data["Cluster"].value_counts()) .reset_index() .rename(columns={"count": "N responses in topic"}) )

beingkk · 2024-11-12T11:12:01Z

dsp_interview_transcripts/pipeline/prep_output_tables.py

+    data_w_names = data_w_names.rename(columns={"llama3.2_name": "Name", "llama3.2_description": "Description"})
+    data_w_names = pd.merge(data_w_names, topic_counts, left_on="Cluster", right_on="Cluster", how="left")
+
+    data_w_names[["Name", "Description", "Top Words", "N responses in topic"]].to_csv(OUTPUT_PATH_SUMMARY, index=False)


Same point here about chains

beingkk · 2024-11-12T11:12:50Z

dsp_interview_transcripts/pipeline/prep_output_tables.py

+    data_viz = pd.merge(data, data_w_names[["Cluster", "Name", "Description"]], on="Cluster", how="left")
+
+    data_viz["Name"] = data_viz["Name"].fillna("None")
+    data_viz["Description"] = data_viz["Description"].fillna("None")


Also can use chains here and, for example, the function .assign(Name = lambda df: df['Name'].fillna("None"))

…ew_transcripts into 5-summarise-clusters

Co-authored-by: Karlis Kanders <[email protected]>

RFOxbury added 2 commits October 8, 2024 16:55

Save topic info and produce names and descriptions for clusters

896906c

Lint files

866bcb2

RFOxbury linked an issue Oct 8, 2024 that may be closed by this pull request

Summarise clusters #5

Closed

RFOxbury added 18 commits October 9, 2024 09:08

Add notebook on semantic chunking

3f104f9

Check buffer sizes

98b08c2

Merge branch '7-semantic-chunking' into 5-summarise-clusters

b490c46

Expand the pipeline:

ab0be63

* Add cluster naming/summarisation step * Add basic semantic chunking script for comparison with other datasets

Lint files

1b03982

Link interview prompt questions in app

756eac0

Reorganise pipeline:

ae5bec7

* There is now one set of scripts for a bottom-up analysis * A streamlit app to visualise the output of the bottom-up analysis * One script to execute a top-down analysis

Small updates

8030d22

Rename app

94e28e4

Rename again because I was on the wrong branch

05b9b4b

Merge branch 'dev' into 5-summarise-clusters

52221ba

Add csvs to gitignore

8a6315a

Add poetry.lock to gitignore

b4a7353

Add types and docstrings

6e469ff

Update readme

fc44cfd

Untrack poetry lock

37bd674

Add dependencies and remove unused scripts

6042f3d

Add readme and add small fixes

af28698

RFOxbury marked this pull request as ready for review November 7, 2024 15:12

RFOxbury requested a review from beingkk November 7, 2024 15:12

RFOxbury and others added 4 commits November 11, 2024 15:00

Remove error saving

55a5434

Add ollama instructions to readme

d5b6fd3

adding tests

512f2e8

Merge branch '5-summarise-clusters' of https://github.com/nestauk/dsp…

9d2c07c

…_interview_transcripts into 5-summarise-clusters

beingkk approved these changes Nov 12, 2024

View reviewed changes

RFOxbury added 2 commits November 12, 2024 13:25

Merge branch '5-summarise-clusters' of github.com:nestauk/dsp_intervi…

71b77ce

…ew_transcripts into 5-summarise-clusters

Remove deprecated script

5230b9d

RFOxbury and others added 3 commits November 12, 2024 13:53

Add chaining for readability

4e5cbf3

Update dsp_interview_transcripts/pipeline/name_clusters.py

39c1eb9

Co-authored-by: Karlis Kanders <[email protected]>

Update dsp_interview_transcripts/pipeline/README.md

61e9751

Co-authored-by: Karlis Kanders <[email protected]>

RFOxbury merged commit 88e233a into dev Nov 12, 2024

RFOxbury deleted the 5-summarise-clusters branch November 12, 2024 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5 summarise clusters #6

5 summarise clusters #6

RFOxbury commented Oct 8, 2024 •

edited

Loading

beingkk left a comment

beingkk Nov 8, 2024

beingkk Nov 8, 2024

beingkk Nov 8, 2024

beingkk Nov 12, 2024

RFOxbury Nov 12, 2024

beingkk Nov 12, 2024

RFOxbury Nov 12, 2024

beingkk Nov 12, 2024

beingkk Nov 12, 2024

beingkk Nov 12, 2024

	and the actual questions produced by the bot, using cosine similarity.
	and the actual sentences produced by the bot, using cosine similarity.

		interviews_cleaned_df = interviews_df.groupby("conversation").apply(remove_preamble).reset_index(drop=True)

		topic_counts = pd.DataFrame(data["Cluster"].value_counts()).reset_index()
		topic_counts = topic_counts.rename(columns={"count": "N responses in topic"})

5 summarise clusters #6

5 summarise clusters #6

Conversation

RFOxbury commented Oct 8, 2024 • edited Loading

beingkk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RFOxbury commented Oct 8, 2024 •

edited

Loading